Subject: Open Source Technologies
Teacher: Lendák Imre Dr
Student: Maksim Kumundzhiev
Neptun Code: V249C6
In this assignement we are going to use 2 different approaches:
Thumbnail sketch:
#Imports
import sys
from yahoo_finance_api2 import share
from yahoo_finance_api2.exceptions import YahooFinanceError
from IPython import display
import pandas as pd
import plotly.graph_objects as go
from datetime import datetime
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import warnings
warnings.filterwarnings("ignore")
from scipy.stats import norm
from scipy.stats import kurtosis
1) Set up which data we would like to parse / In our case ration of price for Bitcoin & Dollar
2) Parsing data from web application.
- Additionally, we can tune up:
- range of days: share.PERIOD_TYPE_DAY
- frequence: share.FREQUENCY_TYPE_MINUTE
3) We will get a dictionary (symbol_data) with following keys:
- dict_keys(['timestamp', 'open', 'high', 'low', 'close', 'volume'])
4) After, we can create a DataFrame from dictionary using pd.DataFrame.from_dict
#Executed version for parsing data --> NOT STREAM
#Set up which data we would like to parse / In our case ration of price for Bitcoin & Dollar
my_share = share.Share('BTC-USD')
symbol_data = None
#Parsing data from web application.
#Additionally, we can tune up:
# - range of days: share.PERIOD_TYPE_DAY
# - frequence: share.FREQUENCY_TYPE_MINUTE
try:
symbol_data = my_share.get_historical(share.PERIOD_TYPE_DAY,
10,
share.FREQUENCY_TYPE_MINUTE,
60)
except YahooFinanceError as e:
print(e.message)
sys.exit(1)
#We will get a dictionary (symbol_data) with following keys:
#dict_keys(['timestamp', 'open', 'high', 'low', 'close', 'volume'])
#print(symbol_data.keys())
#print(symbol_data['open'])
#After, we can create a DataFrame from dictionary using pd.DataFrame.from_dict
data = pd.DataFrame.from_dict(symbol_data)
We can see default DataFrame with financial data for BTC-USD for 10 days.
- First issue is to figure our with timestamp column, because it is mis understandable type
for people and visualisation also
data.head()
We can do it with simple manipulations:
- datetime.fromtimestamp(int(str(data['timestamp'][for_each_value_in_timestamp_column])[:10]
Ex:
data['timestamp'] = [datetime.fromtimestamp(int(str(data['timestamp'][index])[:10])) for index in range(len(data['timestamp']))]
We can additionaly check our data for NaN values.
- Do it using data.info()
data.info()
data.head()
First we will plot graph for 5 days (it can be regularisated changing: data['timestamp'][:this_value])
- On the X axis we can observe time range
- On the Y axis we can observe price range
- On the graph we can observe usual candles with 5 parameters: [Open, Close, Low, High, Date and Time]
fig = go.Figure(data=[go.Candlestick(x=data['timestamp'][:5],
open=data['open'],
high=data['high'],
low=data['low'],
close=data['close'])])
#fig.update_layout(xaxis_rangeslider_visible=False)
fig.show()
Let us have a look on the full period of time:
- On the graph we can observe there was a high leap on the 5th of December in approximately 13:00 p.m
fig = go.Figure(data=[go.Candlestick(x=data['timestamp'],
open=data['open'],
high=data['high'],
low=data['low'],
close=data['close'])])
#fig.update_layout(xaxis_rangeslider_visible=False)
fig.show()
Also it is necessary to play around the graph!
from IPython.display import Video
Video("/Users/macbook/Desktop/Candle.m4v", embed=True,width=800, height=300)
After we will plot similar graph by ourselves:
- Use default matplotlib.pyplot;
- Chunk by chunk add different columns from DataFrame (Open, Close ...);
As we can observe, we got similar graph, but we had customized it a little bit, done the following steps:
- High feature has green color
- Low feature has red color
- Open feature blue and triangles
- Close feature yellow and stars
- On the Y axis price has ticker of Dollar
- On the X axis dates has similar foramt as on the previous graph
fig=plt.figure(figsize=(16, 8));
fig.show();
ax=fig.add_subplot(111);
formatter = ticker.FormatStrFormatter('$%1.2f')
ax.yaxis.set_major_formatter(formatter)
for tick in ax.yaxis.get_major_ticks():
tick.label1.set_visible(True)
tick.label1.set_color('green')
ax.plot(data['timestamp'], data['open'], c='b', marker="^",ls='--',label='Open Price');
ax.plot(data['timestamp'],data['close'], c='y',marker=(8,2,0),ls='--',label='Close Price');
ax.plot(data['timestamp'],data['high'], c='g',ls='-',label='High Price');
ax.plot(data['timestamp'],data['low'], c='r',ls='-',label='Low Price');
plt.grid(True)
plt.legend(loc=1)
plt.draw()
After let us have a look on a different additional observations:
- make sure we have data for 10 days
print(data['timestamp'].min(), data['timestamp'].max())
print(data['timestamp'].max() - data['timestamp'].min())
Let us observe on the differences between Open and Close Prices for first day and last day:
- we can see the prices had increased on 82 and 81 dollars consistently
fig=plt.figure(figsize=(10, 8));
fig.show();
ax=fig.add_subplot(111);
formatter = ticker.FormatStrFormatter('$%1.2f')
ax.yaxis.set_major_formatter(formatter)
for tick in ax.yaxis.get_major_ticks():
tick.label1.set_visible(True)
tick.label1.set_color('green')
ax.plot(data['timestamp'].min(), data[data['timestamp']==data['timestamp'].min()]['open'], c='b', marker="^",ls='--',label='Min Open Price');
ax.plot(data['timestamp'].min(),data[data['timestamp']==data['timestamp'].min()]['close'], c='g',marker=(8,2,0),ls='--',label='Min Close Price');
ax.plot(data['timestamp'].max(), data[data['timestamp']==data['timestamp'].max()]['open'], c='b', marker="^",ls='--',label='Max Open Price');
ax.plot(data['timestamp'].max(),data[data['timestamp']==data['timestamp'].max()]['close'], c='g',marker=(8,2,0),ls='--',label='Max Close Price');
plt.grid(True)
plt.legend(loc=1)
plt.draw()
print('The difference between Open and Close Prices for first day and last day is: \n {0}, \n {1}, \n {2}, \n {3}'.format(data[data['timestamp']==data['timestamp'].min()]['open'], data[data['timestamp']==data['timestamp'].min()]['close'], data[data['timestamp']==data['timestamp'].max()]['open'], data[data['timestamp']==data['timestamp'].max()]['close']))
print('Total difference between Open and Close Prices for first day and last day is: \n Open - {0} \n Close - {1}'.format(int(data[data['timestamp']==data['timestamp'].min()]['open']) - int(data[data['timestamp']==data['timestamp'].max()]['open']), int(data[data['timestamp']==data['timestamp'].min()]['close']) - int(data[data['timestamp']==data['timestamp'].max()]['close'])))
Now, let us create some useful columns for us to make some interesting inferences about the stock
- we will create the column ‘Daily Lag’ which is basically just shifting the ‘Close’ price by one day back
- we will create the column 'Daily Returns'
data['Daily Lag'] = data['close'].shift(1)
data.head()
data['Daily Returns'] = (data['Daily Lag']/data['close']) -1
data.head()
Let us have a look on ‘Daily Returns’
data['Daily Returns'].hist(bins=20);
data['Daily Returns'].hist(bins=20)
plt.axvline(data['Daily Returns'].mean(), color='red',linestyle='dashed',linewidth=2)
#to plot the std line we plot both the positive and negative values
plt.axvline(data['Daily Returns'].std(), color='g',linestyle='dashed',linewidth=2)
plt.axvline(-data['Daily Returns'].std(), color='g',linestyle='dashed',linewidth=2);
After let's see on Kurtosis. It tells us the ‘fatness’ of the tail and it is important because it tells you how ‘extreme’ can the values get.
data['Daily Returns'].kurtosis()
And the last, again, have a separeted view on the features:
data['close'].plot(figsize=(10, 7))
# Define the label for the title of the figure
plt.title("Close Price", fontsize=16)
# Define the labels for x-axis and y-axis
plt.ylabel('Price', fontsize=14)
plt.xlabel('Time', fontsize=14)
# Plot the grid lines
plt.grid(which="major", color='k', linestyle='-.', linewidth=0.5)
plt.show()
data['open'].plot(figsize=(10, 7))
# Define the label for the title of the figure
plt.title("Open Price", fontsize=16)
# Define the labels for x-axis and y-axis
plt.ylabel('Price', fontsize=14)
plt.xlabel('Time', fontsize=14)
# Plot the grid lines
plt.grid(which="major", color='k', linestyle='-.', linewidth=0.5)
plt.show()
data['high'].plot(figsize=(10, 7))
# Define the label for the title of the figure
plt.title("High Price", fontsize=16)
# Define the labels for x-axis and y-axis
plt.ylabel('Price', fontsize=14)
plt.xlabel('Time', fontsize=14)
# Plot the grid lines
plt.grid(which="major", color='k', linestyle='-.', linewidth=0.5)
plt.show()
data['low'].plot(figsize=(10, 7))
# Define the label for the title of the figure
plt.title("Low Price", fontsize=16)
# Define the labels for x-axis and y-axis
plt.ylabel('Price', fontsize=14)
plt.xlabel('Time', fontsize=14)
# Plot the grid lines
plt.grid(which="major", color='k', linestyle='-.', linewidth=0.5)
plt.show()
data['volume'].plot(figsize=(10, 7))
# Define the label for the title of the figure
plt.title("Volume Price", fontsize=16)
# Define the labels for x-axis and y-axis
plt.ylabel('Price', fontsize=14)
plt.xlabel('Time', fontsize=14)
# Plot the grid lines
plt.grid(which="major", color='k', linestyle='-.', linewidth=0.5)
plt.show()
- Parse data using approach from scratch from Yahoo web application(Stream)
#Imports
import bs4
import requests
from bs4 import BeautifulSoup
import logging
import datetime
from multiprocessing import Process, current_process
from multiprocessing import Pool
import os
Write functions for parsing all this features from: http://in.finance.yahoo.com/quote/FB?p=FB
Below there are 5 functions which will parse the data
#Should be Global value for while datetime.datetime.now().second < 20:
#Execute ray.shutdown() in case if ray.init() inicialised once
#ray.shutdown()
# if ray.is_initialized() == True:
# ray.shutdown()
# ray.init(local_mode=True)
#@ray.remote
def parse_price(link_for_company):
process_id = os.getpid()
price = []
while datetime.datetime.now().second < 59:
r = requests.get('{}'.format(link_for_company))
soup = bs4.BeautifulSoup(r.text, 'html.parser')
price.append(soup.find_all('div', {'class' : 'My(6px) Pos(r) smartphone_Mt(6px)'})[0].find('span').text)
print(f'Process ID: {process_id}')
return price
#@ray.remote
def parse_close(link_for_company):
process_id = os.getpid()
Close = []
while datetime.datetime.now().second < 59:
r = requests.get('{}'.format(link_for_company))
soup = bs4.BeautifulSoup(r.text, 'html.parser')
Close.append(soup.find_all('td', {'class' : 'Ta(end) Fw(600) Lh(14px)', 'data-test' : 'PREV_CLOSE-value'})[0].find('span').text)
print(f'Process ID: {process_id}')
return Close
#@ray.remote
def parse_open(link_for_company):
process_id = os.getpid()
Open = []
while datetime.datetime.now().second < 59:
r = requests.get('{}'.format(link_for_company))
soup = bs4.BeautifulSoup(r.text, 'html.parser')
Open.append(soup.find_all('td', {'class' : 'Ta(end) Fw(600) Lh(14px)', 'data-test' : 'OPEN-value'})[0].find('span').text)
print(f'Process ID: {process_id}')
return Open
#@ray.remote
def parse_bid(link_for_company):
process_id = os.getpid()
Bid = []
while datetime.datetime.now().second < 59:
r = requests.get('{}'.format(link_for_company))
soup = bs4.BeautifulSoup(r.text, 'html.parser')
variables = list((soup.find_all('td', {'class': 'Ta(end) Fw(600) Lh(14px)', 'data-test': 'BID-value'})[0].find('span').text.split(' ')))
Bid.append([float(variable) for variable in variables if not variable.isalpha()])
print(f'Process ID: {process_id}')
return Bid
#@ray.remote
def parse_volume(link_for_company):
process_id = os.getpid()
Volume = []
while datetime.datetime.now().second < 59:
r = requests.get('{}'.format(link_for_company))
soup = bs4.BeautifulSoup(r.text, 'html.parser')
Volume.append(soup.find_all('td', {'class' : 'Ta(end) Fw(600) Lh(14px)', 'data-test' : 'TD_VOLUME-value'})[0].find('span').text)
print(f'Process ID: {process_id}')
return Volume
#@ray.remote
def parse_average_volume(link_for_company):
process_id = os.getpid()
AverageVolume = []
while datetime.datetime.now().second < 59:
r = requests.get('{}'.format(link_for_company))
soup = bs4.BeautifulSoup(r.text, 'html.parser')
AverageVolume.append(soup.find_all('td', {'class' : 'Ta(end) Fw(600) Lh(14px)', 'data-test' : 'AVERAGE_VOLUME_3MONTH-value'})[0].find('span').text)
print(f'Process ID: {process_id}')
return AverageVolume
parse_open(link_for_company='http://in.finance.yahoo.com/quote/FB?p=FB')
parse_close('http://in.finance.yahoo.com/quote/FB?p=FB')
As we can observe, we got the Open and Close price values for one minute.
But it is not interesting, because usually they change just once per day.
- Let us have a look on a real price of stock during the minute;
parse_price('http://in.finance.yahoo.com/quote/FB?p=FB')
Let us have a look on the differencies of price during one minute;
prices = parse_price('http://in.finance.yahoo.com/quote/FB?p=FB')
plt.plot(prices);
Additionaly we can create a DataFrame using parsed information
Open = parse_open('http://in.finance.yahoo.com/quote/FB?p=FB')
Close = parse_close('http://in.finance.yahoo.com/quote/FB?p=FB')
Bid = parse_bid('http://in.finance.yahoo.com/quote/FB?p=FB')
Avg = parse_average_volume('http://in.finance.yahoo.com/quote/FB?p=FB')
# #Define a dict for DataFrame
# dic = {'Open': Open, 'Close': Close, 'Bid': Bid, 'Price': prices, 'Average': Avg}
# #Define a df
# df = pd.DataFrame(columns = ['Open', 'Close', 'Bid', 'Price', 'Average'], data = dic)
Here we will face with the issue of different lengthes of results, because the functions start parsing on each moment of time, but will finish in the end of the minute. I have done this step expressly, in order to put attention on the Asynchronous Executing of the program;
#Check different lengthes
dic = {'Open': Open, 'Close': Close, 'Bid': Bid, 'Price': prices, 'Average': Avg}
[len(key) for key in dic.keys()]
In order to solve this step, there are lot's of different solutions such as:
- import lib multiprocessing
- import lib roy
- import lib threads
- import lib async
- ...
One of these solutions will be used in the project.
# link_for_company = 'http://in.finance.yahoo.com/quote/FB?p=FB'
# list_of_functions = [parse_price, parse_close, parse_open, parse_bid, parse_volume, parse_average_volume]
# processes = []
# for function in list_of_functions:
# process = Process(target=function, args=(link_for_company, ))
# processes.append(process)
# process.start()